AITopics | Ewing

Collaborating Authors

Ewing

Stopping Active Learning based on Predicted Change of F Measure for Text Classification

arXiv.org Machine LearningJan-25-2019

Abstract--During active learning, an effective stopping method allows users to limit the number of annotations, which is cost effective. In this paper, a new stopping method called Predicted Change of F Measure will be introduced that attempts to provide the users an estimate of how much performance of the model is changing at each iteration. This stopping method can be applied with any base learner. This method is useful for reducing the data annotation bottleneck encountered when building text classification systems. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12]. A. Active Learning Active learning is a form of machine learning that gives the model the ability to select the data on which it wants to learn from and to choose when to end the process of training. In active learning, the model is first provided a small batch of annotated data to be trained on.

active learning, artificial intelligence, text classification, (21 more...)

arXiv.org Machine Learning

1901.09118

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > California > Orange County (0.14)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.30)

Add feedback

The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Beatty, Garrett, Kochis, Ethan, Bloodgood, Michael

arXiv.org Machine LearningJan-25-2019

Abstract-- Annotation of training data is the major bottleneck in the creation of text classification systems. Active learning is a commonly used technique to reduce the amount of training data one needs to label. A crucial aspect of active learning is determining when to stop labeling data. Three potential sources for informing when to stop active learning are an additional labeled set of data, an unlabeled set of data, and the training data that is labeled during the process of active learning. To date, no one has compared and contrasted the advantages and disadvantages of stopping methods based on these three information sources. We find that stopping methods that use unlabeled data are more effective than methods that use labeled data. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12].

active learning, artificial intelligence, inductive learning, (19 more...)

arXiv.org Machine Learning

1901.09126

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > California > Orange County (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Impact of Batch Size on Stopping Active Learning for Text Classification

Beatty, Garrett, Kochis, Ethan, Bloodgood, Michael

arXiv.org Machine LearningJan-24-2018

When using active learning, smaller batch sizes are typically more efficient from a learning efficiency perspective. However, in practice due to speed and human annotator considerations, the use of larger batch sizes is necessary. While past work has shown that larger batch sizes decrease learning efficiency from a learning curve perspective, it remains an open question how batch size impacts methods for stopping active learning. We find that large batch sizes degrade the performance of a leading stopping method over and above the degradation that results from reduced learning efficiency. We analyze this degradation and find that it can be mitigated by changing the window size parameter of how many past iterations of learning are taken into account when making the stopping decision. We find that when using larger batch sizes, stopping methods are more effective when smaller window sizes are used.

artificial intelligence, batch size, text classification, (18 more...)

arXiv.org Machine Learning

1801.07887

Country:

North America > United States > New Jersey > Mercer County > Ewing (0.17)
North America > United States > California > Orange County > Laguna Hills (0.15)

Genre: Research Report > New Finding (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.42)

Add feedback

Support Vector Machine Active Learning Algorithms with Query-by-Committee versus Closest-to-Hyperplane Selection

Bloodgood, Michael

arXiv.org Machine LearningJan-24-2018

This paper investigates and evaluates support vector machine active learning algorithms for use with imbalanced datasets, which commonly arise in many applications such as information extraction applications. Algorithms based on closest-to-hyperplane selection and query-by-committee selection are combined with methods for addressing imbalance such as positive amplification based on prevalence statistics from initial random samples. Three algorithms (ClosestPA, QBagPA, and QBoostPA) are presented and carefully evaluated on datasets for text classification and relation extraction. The ClosestPA algorithm is shown to consistently outperform the other two in a variety of ways and insights are provided as to why this is the case.

active learning, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1801.07875

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > United States > California > Orange County > Laguna Hills (0.14)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Add feedback

Using Global Constraints and Reranking to Improve Cognates Detection

Bloodgood, Michael, Strauss, Benjamin

arXiv.org Machine LearningAug-19-2017

Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and results in significant performance improvements beyond current state of the art performance on publicly available datasets with different language pairs and various conditions such as different levels of baseline state of the art performance and different data size conditions, including with more realistic large data size conditions than have been evaluated with in the past.

artificial intelligence, computational linguistics, constraint-based reasoning, (15 more...)

arXiv.org Machine Learning

doi: 10.18653/v1/P17-1181

1704.0705

Country:

Europe (1.00)
North America > United States > New Jersey > Mercer County > Ewing (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)

Genre: Research Report > Promising Solution (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Filtering Tweets for Social Unrest

Mishler, Alan, Wonus, Kevin, Chambers, Wendy, Bloodgood, Michael

arXiv.org Machine LearningApr-1-2017

Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic language tweets as relevant to unrest with high reliability. We examine the relationship between training data size and performance and investigate ways to optimize the model building process while minimizing cost. We also explore how confidence thresholds can be set to achieve desired levels of performance.

immunology, law enforcement, tweet, (23 more...)

arXiv.org Machine Learning

doi: 10.1109/ICSC.2017.75

1702.06216

Country:

Europe (0.68)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > New Jersey > Mercer County > Ewing (0.14)

Genre: Research Report (0.83)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.70)

Add feedback

Protein-Ligand Scoring with Convolutional Neural Networks

Ragoza, Matthew, Hochuli, Joshua, Idrobo, Elisa, Sunseri, Jocelyn, Koes, David Ryan

arXiv.org Machine LearningDec-8-2016

Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring. We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.

cnn model, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

1612.02751

Country: North America > United States > New Jersey > Mercer County > Ewing (0.14)

Genre: Research Report (1.00)

Industry:

Materials > Chemicals (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback